Average word length | # of sentences | Source |
---|---|---|
5.85 | 14 | Tri gaštanové kone |
5.98 | 14 | Stavba Zeme |
5.98 | 11 | Zápisník jednej lásky |
6.00 | 35 | Everwood |
6.03 | 21 | Piráti Karibiku (2003) |
6.03 | 12 | Syd Barrett |
6.05 | 23 | Žiarlivosť (Odložený prípad) |
6.09 | 24 | Šrí Ánandamáji Má |
6.10 | 11 | Piráti Karibiku: Truhlica mŕtveho muža |
6.13 | 21 | Jeffriesov boj (Odložený prípad) |
6.19 | 10 | Dámsky gambit |
6.21 | 18 | Doom 3 |
6.22 | 13 | Kto chytá v žite |
6.24 | 15 | Zjavenia Panny Márie v Litmanovej |
6.24 | 29 | Výbuch (Odložený prípad) |
6.26 | 16 | The Stone Roses |
6.28 | 11 | Súhvezdie Býk |
6.29 | 15 | O myšiach a ľuďoch |
6.31 | 12 | Frasier |
6.31 | 33 | Stanleyho pohár |
6.31 | 10 | Otto Smik |
6.32 | 10 | Mačka Temminckova |
6.32 | 13 | Robbie Williams |
6.32 | 12 | Čarodejnice (seriál) |
6.33 | 12 | Molnárovci v kríze (Mafstory) |
6.33 | 16 | Zázrivá (okres Dolný Kubín) |
6.34 | 13 | Jana Eyrová |
6.35 | 11 | Atlanta Thrashers |
6.36 | 18 | Dekameron |
6.37 | 26 | Usain Bolt |
Average word length | # of sentences | Source |
---|---|---|
8.80 | 10 | Energetický medzipriestor |
8.59 | 12 | Škandinávska architektúra |
8.58 | 20 | Digitálna knižnica |
8.56 | 12 | Gromovov–Wittenov invariant |
8.54 | 31 | Počítač |
8.54 | 10 | Sad Janka Kráľa |
8.49 | 10 | Toluén |
8.45 | 13 | Ozvučnica |
8.44 | 12 | Teória chaosu |
8.42 | 17 | Mikrokernel |
8.41 | 12 | Geológia Českého masívu |
8.41 | 12 | Komunitná záhrada |
8.39 | 13 | Súborový systém |
8.37 | 22 | Nesteroidné antiflogistikum |
8.36 | 10 | Schizofrénia, schizotypové poruchy a poruchy s bludmi (MKCH-10) |
8.34 | 20 | Špeciálna teória relativity |
8.33 | 10 | Heterosexualita |
8.33 | 10 | Superpočítač |
8.30 | 12 | Burda (pohorie) |
8.30 | 24 | Kreativita |
8.29 | 10 | Národná banka Slovenska (budova) |
8.29 | 15 | Južná Afrika (štát) |
8.27 | 15 | Antivírusový softvér |
8.26 | 16 | Schizoafektívna porucha |
8.26 | 10 | Ivan Stanislav |
8.24 | 20 | Slovensko |
8.24 | 11 | Reklamná agentúra |
8.23 | 11 | Poruchy osobnosti a správania dospelých (MKCH-10) |
8.20 | 10 | Optická chyba |
8.20 | 14 | Metrológia |
The problem addressed in this subsection (as well as the results) is similar to 6.4.1.1, but now we focus on average word length instead of average sentence length.
Measuring average word length strongly depends on tokenization. The usual tokenization might split the string “28.06.2005” into five parts “28 . 06 . 2005” of average length two. To avoid this, the number of words is counted as 1 + (number of blanks in the sentence).
select round(avg(length(sentence) / (1+ length(sentence) - length(replace(sentence," ","")))),2) as le, count(sentence) as cnt, source from sentences s, inv_so i, sources so where s.s_id=i.s_id and i.so_id=so.so_id group by source having cnt>=10 order by le limit 30;
6.4.2.2 Average logarithmic word rank for different sources
6.4.2.3 Sources consisting of many / few words with frequency 1
6.4.2.4 Sources with low / high average word length of rare words